Fowlkes–Mallows Score (fowlkes_mallows_score)#

The Fowlkes–Mallows index (FMI) measures how similar two labelings of the same n samples are.

It is most often used for external clustering validation:

  • labels_true: ground-truth classes (or a reference clustering)

  • labels_pred: cluster assignments produced by an algorithm

FMI is a pair-counting metric: it looks at all pairs (i, j) and checks whether each pair is placed in the same cluster in both labelings.

Goals#

  • Build intuition with tiny examples + pairwise heatmaps.

  • Derive the FMI formula (pair precision/recall) and compute it via a contingency matrix.

  • Implement FMI from scratch in NumPy and match scikit-learn.

  • See how FMI reacts to permutations, merges/splits, and label noise.

  • Use FMI as a model-selection objective for a simple logistic regression classifier.

Quick import#

from sklearn.metrics import fowlkes_mallows_score
import numpy as np

import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots

from sklearn.datasets import make_blobs, make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import fowlkes_mallows_score as sk_fowlkes_mallows_score

pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")

np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(7)
import sys

import sklearn
import plotly

print("python:", sys.version.split()[0])
print("numpy :", np.__version__)
print("sklearn:", sklearn.__version__)
print("plotly:", plotly.__version__)
python: 3.12.9
numpy : 1.26.2
sklearn: 1.6.0
plotly: 6.5.2

1) Pair-counting view (what FMI actually measures)#

Let \(y^{(T)}\) be the true labels and \(y^{(P)}\) be the predicted labels.

For each pair of samples \((i, j)\) with \(i<j\), define:

  • same-true: \(\mathbb{1}[y^{(T)}_i = y^{(T)}_j]\)

  • same-pred: \(\mathbb{1}[y^{(P)}_i = y^{(P)}_j]\)

This turns clustering comparison into a binary decision per pair (together vs apart), yielding pairwise counts:

  • TP: together in both true and predicted

  • FP: together in predicted, apart in true

  • FN: together in true, apart in predicted

  • TN: apart in both (not used by FMI)

Pairwise precision and recall:

\[ \text{precision}_{\text{pair}} = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}},\qquad \text{recall}_{\text{pair}} = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}} \]

The Fowlkes–Mallows index is the geometric mean:

\[ \mathrm{FMI} = \sqrt{\text{precision}_{\text{pair}}\,\text{recall}_{\text{pair}}} = \frac{\mathrm{TP}}{\sqrt{(\mathrm{TP}+\mathrm{FP})(\mathrm{TP}+\mathrm{FN})}} \]

Notes:

  • FMI is label-permutation invariant (cluster IDs are arbitrary).

  • FMI ignores true negatives (pairs separated in both clusterings). With many clusters, TN pairs dominate, so ignoring them keeps the metric focused on who gets grouped together.

def pair_confusion_counts_bruteforce(labels_true, labels_pred):
    """Pair confusion counts (TP, FP, FN, TN) in O(n^2).

    A pair (i, j) with i<j is a 'positive' if the two samples are in the same cluster.
    """

    labels_true = np.asarray(labels_true).ravel()
    labels_pred = np.asarray(labels_pred).ravel()
    if labels_true.shape != labels_pred.shape:
        raise ValueError(f"shape mismatch: true{labels_true.shape} vs pred{labels_pred.shape}")

    n = labels_true.size
    tp = fp = fn = tn = 0

    for i in range(n - 1):
        for j in range(i + 1, n):
            same_true = labels_true[i] == labels_true[j]
            same_pred = labels_pred[i] == labels_pred[j]

            if same_true and same_pred:
                tp += 1
            elif (not same_true) and same_pred:
                fp += 1
            elif same_true and (not same_pred):
                fn += 1
            else:
                tn += 1

    return tp, fp, fn, tn


def fowlkes_mallows_from_pair_counts(tp, fp, fn):
    den = (tp + fp) * (tp + fn)
    if den == 0:
        return 0.0
    return float(tp / np.sqrt(den))


# Tiny example: 8 items, 3 true clusters
labels_true = np.array([0, 0, 0, 1, 1, 2, 2, 2])

# A predicted clustering that *merges* clusters 1 and 2
labels_pred = np.array([0, 0, 0, 1, 1, 1, 1, 1])

tp, fp, fn, tn = pair_confusion_counts_bruteforce(labels_true, labels_pred)

precision_pair = tp / (tp + fp) if (tp + fp) else 0.0
recall_pair = tp / (tp + fn) if (tp + fn) else 0.0
fmi = fowlkes_mallows_from_pair_counts(tp, fp, fn)

n = labels_true.size
pairs_total = n * (n - 1) // 2

print(f"n={n}, total pairs={pairs_total}")
print(f"TP={tp}, FP={fp}, FN={fn}, TN={tn}")
print(f"pair-precision={precision_pair:.3f}, pair-recall={recall_pair:.3f}")
print(f"FMI (pair counts)={fmi:.3f}")
print(f"FMI (sklearn)     ={sk_fowlkes_mallows_score(labels_true, labels_pred):.3f}")
n=8, total pairs=28
TP=7, FP=6, FN=0, TN=15
pair-precision=0.538, pair-recall=1.000
FMI (pair counts)=0.734
FMI (sklearn)     =0.734
def same_cluster_matrix(labels):
    labels = np.asarray(labels).ravel()
    return labels[:, None] == labels[None, :]


same_true = same_cluster_matrix(labels_true)
same_pred = same_cluster_matrix(labels_pred)

# Encode each (i,j) pair into a category:
# -1 = diagonal, 0 = TN, 1 = FN, 2 = FP, 3 = TP
pair_type = np.zeros_like(same_true, dtype=int)
pair_type[(same_true) & (same_pred)] = 3
pair_type[(~same_true) & (same_pred)] = 2
pair_type[(same_true) & (~same_pred)] = 1
pair_type[(~same_true) & (~same_pred)] = 0
np.fill_diagonal(pair_type, -1)

colorscale_pair = [
    [0.00, "#e0e0e0"],
    [0.125, "#e0e0e0"],  # -1 diag
    [0.125, "#ffffff"],
    [0.375, "#ffffff"],  # 0 TN
    [0.375, "#d73027"],
    [0.625, "#d73027"],  # 1 FN
    [0.625, "#4575b4"],
    [0.875, "#4575b4"],  # 2 FP
    [0.875, "#1a9850"],
    [1.00, "#1a9850"],  # 3 TP
]

fig = make_subplots(
    rows=1,
    cols=3,
    subplot_titles=[
        "Same cluster? (true)",
        "Same cluster? (pred)",
        "Pair types (TP/FP/FN/TN)",
    ],
)

fig.add_trace(
    go.Heatmap(z=same_true.astype(int), colorscale="Greys", showscale=False),
    row=1,
    col=1,
)
fig.add_trace(
    go.Heatmap(z=same_pred.astype(int), colorscale="Greys", showscale=False),
    row=1,
    col=2,
)

fig.add_trace(
    go.Heatmap(
        z=pair_type,
        zmin=-1,
        zmax=3,
        colorscale=colorscale_pair,
        colorbar=dict(
            title="pair",
            tickvals=[-1, 0, 1, 2, 3],
            ticktext=["diag", "TN", "FN", "FP", "TP"],
        ),
    ),
    row=1,
    col=3,
)

fig.update_layout(
    title="FMI compares clusterings by counting sample pairs",
    height=380,
)
fig.update_xaxes(showticklabels=False)
fig.update_yaxes(showticklabels=False)
fig.show()

2) Efficient computation via a contingency matrix#

The brute-force pair loop is \(O(n^2)\).

A much faster way uses the contingency matrix \(N\):

  • rows = true clusters

  • columns = predicted clusters

  • \(N_{ij}\) = number of samples that are in true cluster \(i\) and predicted cluster \(j\)

Let \(\binom{m}{2} = \frac{m(m-1)}{2}\) be the number of unordered pairs inside a group of size \(m\).

Then:

\[ \mathrm{TP} = \sum_{i,j} \binom{N_{ij}}{2} \]

Also define row and column sums:

\[ a_i = \sum_j N_{ij}\quad (\text{size of true cluster } i),\qquad b_j = \sum_i N_{ij}\quad (\text{size of predicted cluster } j) \]

Pairs placed together by the predicted clustering:

\[ \mathrm{TP}+\mathrm{FP} = \sum_j \binom{b_j}{2} \]

Pairs placed together by the true clustering:

\[ \mathrm{TP}+\mathrm{FN} = \sum_i \binom{a_i}{2} \]

So:

\[ \mathrm{FMI} = \frac{\sum_{i,j} \binom{N_{ij}}{2}} {\sqrt{\left(\sum_j \binom{b_j}{2}\right)\left(\sum_i \binom{a_i}{2}\right)}} \]

This avoids enumerating all pairs.

def comb2(m):
    m = np.asarray(m, dtype=np.int64)
    return m * (m - 1) // 2


def contingency_matrix_numpy(labels_true, labels_pred):
    labels_true = np.asarray(labels_true).ravel()
    labels_pred = np.asarray(labels_pred).ravel()
    if labels_true.shape != labels_pred.shape:
        raise ValueError(f"shape mismatch: true{labels_true.shape} vs pred{labels_pred.shape}")

    _, t = np.unique(labels_true, return_inverse=True)
    _, p = np.unique(labels_pred, return_inverse=True)

    n_true = int(t.max()) + 1 if t.size else 0
    n_pred = int(p.max()) + 1 if p.size else 0

    cm = np.zeros((n_true, n_pred), dtype=np.int64)
    np.add.at(cm, (t, p), 1)
    return cm


cm = contingency_matrix_numpy(labels_true, labels_pred)

tp_fast = int(comb2(cm).sum())
pred_pairs = int(comb2(cm.sum(axis=0)).sum())  # TP + FP
true_pairs = int(comb2(cm.sum(axis=1)).sum())  # TP + FN

fmi_fast = 0.0 if (pred_pairs == 0 or true_pairs == 0) else float(tp_fast / np.sqrt(pred_pairs * true_pairs))

print("contingency matrix N (rows=true, cols=pred):
", cm)
print(f"TP={tp_fast}, TP+FP={pred_pairs}, TP+FN={true_pairs}")
print("FMI (from contingency) =", round(fmi_fast, 6))
  Cell In[5], line 31
    print("contingency matrix N (rows=true, cols=pred):
          ^
SyntaxError: unterminated string literal (detected at line 31)

3) NumPy implementation (from scratch)#

Below is a minimal implementation that mirrors the scikit-learn definition.

Implementation notes:

  • We use the contingency-matrix formula (fast, no \(O(n^2)\) loops).

  • If either labeling has no within-cluster pairs (all singleton clusters), scikit-learn returns 0.0.

def fowlkes_mallows_score_numpy(labels_true, labels_pred):
    """Fowlkes–Mallows score between two labelings.

    Parameters
    ----------
    labels_true, labels_pred : array-like, shape (n_samples,)
        Two labelings of the same samples. Values can be any hashable type.

    Returns
    -------
    fmi : float
        In [0, 1]. 1 means identical pairwise grouping.
    """

    cm = contingency_matrix_numpy(labels_true, labels_pred)

    tp = int(comb2(cm).sum())
    pred_pairs = int(comb2(cm.sum(axis=0)).sum())  # TP + FP
    true_pairs = int(comb2(cm.sum(axis=1)).sum())  # TP + FN

    if pred_pairs == 0 or true_pairs == 0:
        return 0.0

    return float(tp / np.sqrt(pred_pairs * true_pairs))
# Quick correctness checks vs scikit-learn

def check_case(name, y_t, y_p):
    a = fowlkes_mallows_score_numpy(y_t, y_p)
    b = sk_fowlkes_mallows_score(y_t, y_p)
    ok = np.isclose(a, b)
    print(f"{name:28s} numpy={a:.6f}  sklearn={b:.6f}  ok={ok}")
    return ok


ok_all = True
ok_all &= check_case("perfect (permute labels)", [0, 0, 1, 1], [1, 1, 0, 0])
ok_all &= check_case("all singletons", np.arange(6), np.arange(6))
ok_all &= check_case("all same cluster", np.zeros(6, dtype=int), np.zeros(6, dtype=int))
ok_all &= check_case("random labels", rng.integers(0, 4, 80), rng.integers(0, 5, 80))

# A few randomized trials
for _ in range(10):
    y_t = rng.integers(0, rng.integers(2, 8), size=200)
    y_p = rng.integers(0, rng.integers(2, 9), size=200)
    ok_all &= np.isclose(fowlkes_mallows_score_numpy(y_t, y_p), sk_fowlkes_mallows_score(y_t, y_p))

print("
All checks passed?", ok_all)
  Cell In[7], line 23
    print("
          ^
SyntaxError: unterminated string literal (detected at line 23)

4) Behavior: permutations, merges/splits, and label noise#

We’ll create a 2D dataset with obvious clusters and compare several predicted labelings:

  • perfect: identical to the ground truth

  • permuted: same clustering, different numeric IDs (should score 1.0)

  • merged: two clusters merged into one (more false positives)

  • split: one cluster split into two (more false negatives)

  • noisy: a fraction of labels randomly corrupted

X, y_true = make_blobs(
    n_samples=450,
    centers=3,
    cluster_std=(0.9, 1.1, 0.8),
    random_state=7,
)

# Different predicted labelings

y_perfect = y_true.copy()

y_permuted = (y_true + 1) % 3

y_merged = y_true.copy()
y_merged[y_merged == 2] = 1  # merge cluster 2 into cluster 1

# Split cluster 0 into two by x-coordinate
mask0 = y_true == 0
x0 = X[mask0, 0]
cut = np.median(x0)
y_split = y_true.copy()
y_split[mask0 & (X[:, 0] > cut)] = 3

# Add label noise
noise_frac = 0.12
idx = rng.choice(X.shape[0], size=int(noise_frac * X.shape[0]), replace=False)
y_noisy = y_true.copy()
y_noisy[idx] = rng.integers(0, 4, size=idx.size)

noisy_key = f"noisy ({noise_frac:.0%})"

cases = {
    "true": y_true,
    "perfect": y_perfect,
    "permuted": y_permuted,
    "merged": y_merged,
    "split": y_split,
    noisy_key: y_noisy,
}

scores = {name: fowlkes_mallows_score_numpy(y_true, y) for name, y in cases.items() if name != "true"}

for name, s in scores.items():
    print(f"{name:14s} FMI={s:.4f}")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[8], line 41
     30 noisy_key = f"noisy ({noise_frac:.0%})"
     32 cases = {
     33     "true": y_true,
     34     "perfect": y_perfect,
   (...)     38     noisy_key: y_noisy,
     39 }
---> 41 scores = {name: fowlkes_mallows_score_numpy(y_true, y) for name, y in cases.items() if name != "true"}
     43 for name, s in scores.items():
     44     print(f"{name:14s} FMI={s:.4f}")

Cell In[6], line 15, in fowlkes_mallows_score_numpy(labels_true, labels_pred)
      1 def fowlkes_mallows_score_numpy(labels_true, labels_pred):
      2     """Fowlkes–Mallows score between two labelings.
      3 
      4     Parameters
   (...)     12         In [0, 1]. 1 means identical pairwise grouping.
     13     """
---> 15     cm = contingency_matrix_numpy(labels_true, labels_pred)
     17     tp = int(comb2(cm).sum())
     18     pred_pairs = int(comb2(cm.sum(axis=0)).sum())  # TP + FP

NameError: name 'contingency_matrix_numpy' is not defined
# Visualize true labels and a few predicted labelings

palette = px.colors.qualitative.Set2


def label_colors(labels):
    labels = np.asarray(labels)
    uniq = np.unique(labels)
    color_map = {int(k): palette[i % len(palette)] for i, k in enumerate(uniq)}
    return [color_map[int(k)] for k in labels]


fig = make_subplots(
    rows=2,
    cols=3,
    subplot_titles=[
        "ground truth",
        f"perfect (FMI={scores['perfect']:.3f})",
        f"permuted (FMI={scores['permuted']:.3f})",
        f"merged (FMI={scores['merged']:.3f})",
        f"split (FMI={scores['split']:.3f})",
        f"noisy (FMI={scores[noisy_key]:.3f})",
    ],
)

panels = [
    ("true", y_true),
    ("perfect", y_perfect),
    ("permuted", y_permuted),
    ("merged", y_merged),
    ("split", y_split),
    ("noisy", y_noisy),
]

for k, (_, y) in enumerate(panels):
    r = 1 + k // 3
    c = 1 + k % 3
    fig.add_trace(
        go.Scatter(
            x=X[:, 0],
            y=X[:, 1],
            mode="markers",
            marker=dict(size=6, color=label_colors(y), opacity=0.85),
            showlegend=False,
        ),
        row=r,
        col=c,
    )

fig.update_layout(height=650, title="Same points, different labelings → different FMI")
fig.update_xaxes(showticklabels=False)
fig.update_yaxes(showticklabels=False)
fig.show()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[9], line 18
      9     color_map = {int(k): palette[i % len(palette)] for i, k in enumerate(uniq)}
     10     return [color_map[int(k)] for k in labels]
     13 fig = make_subplots(
     14     rows=2,
     15     cols=3,
     16     subplot_titles=[
     17         "ground truth",
---> 18         f"perfect (FMI={scores['perfect']:.3f})",
     19         f"permuted (FMI={scores['permuted']:.3f})",
     20         f"merged (FMI={scores['merged']:.3f})",
     21         f"split (FMI={scores['split']:.3f})",
     22         f"noisy (FMI={scores[noisy_key]:.3f})",
     23     ],
     24 )
     26 panels = [
     27     ("true", y_true),
     28     ("perfect", y_perfect),
   (...)     32     ("noisy", y_noisy),
     33 ]
     35 for k, (_, y) in enumerate(panels):

NameError: name 'scores' is not defined
# FMI as we increase label noise

def corrupt_labels(y, frac, *, rng):
    y = np.asarray(y).copy()
    n = y.size
    m = int(frac * n)
    if m == 0:
        return y

    idx = rng.choice(n, size=m, replace=False)
    labels = np.unique(y)
    y[idx] = rng.choice(labels, size=m)
    return y


noise_grid = np.linspace(0, 0.8, 41)
fmis = []
for frac in noise_grid:
    y_corrupt = corrupt_labels(y_true, frac, rng=rng)
    fmis.append(fowlkes_mallows_score_numpy(y_true, y_corrupt))

fig = go.Figure()
fig.add_trace(go.Scatter(x=noise_grid, y=fmis, mode="lines+markers"))
fig.update_layout(
    title="FMI decreases as we corrupt more labels",
    xaxis_title="fraction of labels randomly reassigned",
    yaxis_title="Fowlkes–Mallows index",
)
fig.show()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[10], line 20
     18 for frac in noise_grid:
     19     y_corrupt = corrupt_labels(y_true, frac, rng=rng)
---> 20     fmis.append(fowlkes_mallows_score_numpy(y_true, y_corrupt))
     22 fig = go.Figure()
     23 fig.add_trace(go.Scatter(x=noise_grid, y=fmis, mode="lines+markers"))

Cell In[6], line 15, in fowlkes_mallows_score_numpy(labels_true, labels_pred)
      1 def fowlkes_mallows_score_numpy(labels_true, labels_pred):
      2     """Fowlkes–Mallows score between two labelings.
      3 
      4     Parameters
   (...)     12         In [0, 1]. 1 means identical pairwise grouping.
     13     """
---> 15     cm = contingency_matrix_numpy(labels_true, labels_pred)
     17     tp = int(comb2(cm).sum())
     18     pred_pairs = int(comb2(cm.sum(axis=0)).sum())  # TP + FP

NameError: name 'contingency_matrix_numpy' is not defined

5) Using FMI to tune a simple model (logistic regression)#

FMI expects hard labels. A classifier like logistic regression produces scores (probabilities), and we choose a threshold \(t\):

\[ \hat{y}_i(t) = \mathbb{1}[\hat{p}_i \ge t] \]

Two important practical points:

  • FMI is not differentiable w.r.t. model parameters because it depends on discrete predictions.

  • A common workflow is:

    1. fit the model with a differentiable loss (e.g. log loss)

    2. select hyperparameters / thresholds by maximizing FMI on a validation set (grid search)

Even though FMI is more common for clustering, it is defined for any pair of labelings — so it can also compare class labels.

def add_intercept(X: np.ndarray) -> np.ndarray:
    X = np.asarray(X, dtype=float)
    return np.c_[np.ones((X.shape[0], 1)), X]


def sigmoid(z):
    z = np.asarray(z, dtype=float)
    out = np.empty_like(z)
    pos = z >= 0
    out[pos] = 1.0 / (1.0 + np.exp(-z[pos]))
    ez = np.exp(z[~pos])
    out[~pos] = ez / (1.0 + ez)
    return out


def log_loss_from_proba(y_true, p, eps=1e-15):
    y_true = np.asarray(y_true, dtype=float)
    p = np.clip(np.asarray(p, dtype=float), eps, 1 - eps)
    return -np.mean(y_true * np.log(p) + (1 - y_true) * np.log(1 - p))


def fit_logistic_regression_gd(
    X,
    y,
    *,
    lr=0.2,
    max_iter=3000,
    alpha=0.0,
    tol=1e-8,
):
    """Binary logistic regression with gradient descent + optional L2 penalty."""

    Xb = add_intercept(X)
    y = np.asarray(y, dtype=float).ravel()

    n, d = Xb.shape
    w = np.zeros(d)
    history = []

    for _ in range(max_iter):
        p = sigmoid(Xb @ w)
        loss = log_loss_from_proba(y, p) + 0.5 * alpha * np.sum(w[1:] ** 2)
        history.append(loss)

        grad = (Xb.T @ (p - y)) / n
        grad[1:] += alpha * w[1:]

        w_new = w - lr * grad
        if np.linalg.norm(w_new - w) < tol:
            w = w_new
            break
        w = w_new

    return w, np.asarray(history)


def predict_proba_logreg(X, w):
    return sigmoid(add_intercept(X) @ w)


# A 2D dataset so we can visualize the decision boundary.
X, y = make_classification(
    n_samples=900,
    n_features=2,
    n_redundant=0,
    n_informative=2,
    n_clusters_per_class=1,
    class_sep=1.1,
    flip_y=0.05,
    random_state=7,
)

X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.35, random_state=7, stratify=y
)

w, history = fit_logistic_regression_gd(X_train, y_train, lr=0.25, max_iter=4000, alpha=1e-2)

fig = go.Figure()
fig.add_trace(go.Scatter(y=history, mode="lines"))
fig.update_layout(title="Logistic regression training loss (log loss)", xaxis_title="iteration", yaxis_title="loss")
fig.show()
# Sweep thresholds and choose the one that maximizes FMI on the validation set

p_val = predict_proba_logreg(X_val, w)

t_grid = np.linspace(0.01, 0.99, 199)
fmi_grid = []

for t in t_grid:
    y_hat = (p_val >= t).astype(int)
    fmi_grid.append(fowlkes_mallows_score_numpy(y_val, y_hat))

fmi_grid = np.asarray(fmi_grid)

best_idx = int(np.argmax(fmi_grid))
best_t = float(t_grid[best_idx])
best_fmi = float(fmi_grid[best_idx])

print(f"best threshold t*={best_t:.3f} -> FMI={best_fmi:.4f}")

fig = go.Figure()
fig.add_trace(go.Scatter(x=t_grid, y=fmi_grid, mode="lines+markers", name="FMI(t)"))
fig.add_vline(x=0.5, line_dash="dash", line_color="gray")
fig.add_vline(x=best_t, line_dash="dash", line_color="black")
fig.update_layout(
    title="Choose a decision threshold by maximizing FMI",
    xaxis_title="threshold t",
    yaxis_title="Fowlkes–Mallows index",
)
fig.show()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[12], line 10
      8 for t in t_grid:
      9     y_hat = (p_val >= t).astype(int)
---> 10     fmi_grid.append(fowlkes_mallows_score_numpy(y_val, y_hat))
     12 fmi_grid = np.asarray(fmi_grid)
     14 best_idx = int(np.argmax(fmi_grid))

Cell In[6], line 15, in fowlkes_mallows_score_numpy(labels_true, labels_pred)
      1 def fowlkes_mallows_score_numpy(labels_true, labels_pred):
      2     """Fowlkes–Mallows score between two labelings.
      3 
      4     Parameters
   (...)     12         In [0, 1]. 1 means identical pairwise grouping.
     13     """
---> 15     cm = contingency_matrix_numpy(labels_true, labels_pred)
     17     tp = int(comb2(cm).sum())
     18     pred_pairs = int(comb2(cm.sum(axis=0)).sum())  # TP + FP

NameError: name 'contingency_matrix_numpy' is not defined
# Visualize how changing the threshold shifts the decision boundary

fig = go.Figure()

# points
fig.add_trace(
    go.Scatter(
        x=X_val[:, 0],
        y=X_val[:, 1],
        mode="markers",
        marker=dict(size=6, color=y_val, colorscale=[[0, "#4575b4"], [1, "#d73027"]], opacity=0.8),
        name="validation points",
    )
)

w0, w1, w2 = w


def decision_line_xy(w0, w1, w2, threshold, x1_grid):
    # p = sigmoid(w0 + w1*x1 + w2*x2) >= t  <=>  w0 + w1*x1 + w2*x2 >= log(t/(1-t))
    logit_t = np.log(threshold / (1 - threshold))
    if abs(w2) < 1e-12:
        return None
    x2 = (logit_t - w0 - w1 * x1_grid) / w2
    return x2


x1_min, x1_max = X_val[:, 0].min() - 0.5, X_val[:, 0].max() + 0.5
x1_grid = np.linspace(x1_min, x1_max, 200)

x2_05 = decision_line_xy(w0, w1, w2, 0.5, x1_grid)
x2_best = decision_line_xy(w0, w1, w2, best_t, x1_grid)

if x2_05 is not None:
    fig.add_trace(
        go.Scatter(
            x=x1_grid,
            y=x2_05,
            mode="lines",
            line=dict(color="gray", dash="dash"),
            name="boundary t=0.5",
        )
    )

if x2_best is not None:
    fig.add_trace(
        go.Scatter(
            x=x1_grid,
            y=x2_best,
            mode="lines",
            line=dict(color="black", dash="dash"),
            name=f"boundary t*={best_t:.3f}",
        )
    )

fig.update_layout(
    title="Threshold choice changes which points are predicted positive",
    xaxis_title="x1",
    yaxis_title="x2",
)
fig.show()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[13], line 32
     29 x1_grid = np.linspace(x1_min, x1_max, 200)
     31 x2_05 = decision_line_xy(w0, w1, w2, 0.5, x1_grid)
---> 32 x2_best = decision_line_xy(w0, w1, w2, best_t, x1_grid)
     34 if x2_05 is not None:
     35     fig.add_trace(
     36         go.Scatter(
     37             x=x1_grid,
   (...)     42         )
     43     )

NameError: name 'best_t' is not defined

6) Pros, cons, and when to use FMI#

Pros

  • Permutation invariant: cluster IDs can be renamed without changing the score.

  • Interpretable as \(\sqrt{\text{pair-precision}\cdot\text{pair-recall}}\).

  • Works with different numbers of clusters in the two labelings.

  • Bounded in \([0,1]\) (higher is better).

Cons / pitfalls

  • Not adjusted for chance: random labelings can score non-zero depending on cluster sizes (compare with adjusted_rand_score).

  • Ignores TN pairs (pairs separated in both). That is often desirable, but it means FMI is not a full “pair accuracy” metric.

  • Can be unintuitive when clusters are tiny: if a labeling has all singleton clusters, there are no within-cluster pairs → FMI is 0.0 (even if the two labelings match).

  • FMI is non-differentiable w.r.t. model parameters (it depends on discrete assignments), so you typically use it for model selection (grid search), not direct gradient optimization.

Good use cases

  • External clustering evaluation when you have ground truth labels.

  • Comparing two different clusterings of the same dataset (algorithm comparison, stability across runs).

  • Selecting clustering hyperparameters (e.g. choosing a cut level in hierarchical clustering) when a reference labeling exists.

Not a good fit

  • Purely unsupervised evaluation (no reference labels): consider internal metrics like silhouette_score or davies_bouldin_score.

  • When you care about per-class errors, false positives vs false negatives as samples (use classification metrics like F1 / ROC-AUC).

7) Exercises#

  1. Construct a case where one clustering is a refinement of the other (every predicted cluster is a subset of a true cluster). What happens to pair precision vs pair recall?

  2. Compare FMI vs adjusted_rand_score on random labelings as you vary the number of clusters.

  3. In the logistic regression example, compare the threshold that maximizes FMI vs the threshold that maximizes F1.

References#

  • Fowlkes, E. B., & Mallows, C. L. (1983). A Method for Comparing Two Hierarchical Clusterings. Journal of the American Statistical Association.

  • scikit-learn: sklearn.metrics.fowlkes_mallows_score documentation.